Biostat 200C Final

Due Fri. June 14, 2024 @ 11:59PM

Author

Ziheng Zhang (UID 606300061)

Published

June 7, 2024

This is a take-home and an open book final. Helping or asking help from others is considered plagiarism.

1 Q1. (25 pts) Survival data analysis

Consider following survival times of 25 patients with no history of chronic diesease (chr = 0) and 25 patients with history of chronic disease (chr = 1).

  1. Manually fill in the missing information in the following tables of ordered failure times for groups 1 (chr = 0) and 2 (chr = 1). Explain how survival probabilities (last column) are calculated.

Answer:

Group 1 (chr = 0):

time n.risk n.event survival
1.8 25 1 0.96
2.2 24 1 0.92
2.5 23 1 0.88
2.6 22 1 0.84
3.0 21 1 0.80
3.5 20 1 0.76
3.8 19 1 0.72
5.3 18 1 0.68
5.4 17 1 0.64
5.7 16 1 0.60
6.6 15 1 0.56
8.2 14 1 0.52
8.7 13 1 0.48
9.2 12 2 0.40
9.8 10 1 0.36
10.0 9 1 0.32
10.2 8 1 0.28
10.7 7 1 0.24
11.0 6 1 0.20
11.1 5 1 0.16
11.7 4 4 0.00

Group 2 (chr = 1):

time n.risk n.event survival
1.4 25 1 0.96
1.6 24 1 0.92
1.8 23 1 0.88
2.4 22 1 0.84
2.8 21 1 0.80
2.9 20 1 0.76
3.1 19 1 0.72
3.5 18 1 0.68
3.6 17 1 0.64
3.9 16 1 0.60
4.1 15 1 0.56
4.2 14 1 0.52
4.7 13 1 0.48
4.9 12 1 0.44
5.2 11 1 0.40
5.8 10 1 0.36
5.9 9 1 0.32
6.5 8 1 0.28
7.8 7 1 0.24
8.3 6 1 0.20
8.4 5 1 0.16
8.8 4 1 0.12
9.1 3 1 0.08
9.9 2 1 0.04
11.4 1 1 0.00

The empirical survivor function, an estimate of the probability of survival beyond time \(y\), is \[ \widehat{S}(y) = \widehat{\mathbb{P}}(Y > y). \] Arrange the even times \(y_{(1)} \le y_{(2)} \le \cdots \le y_{(k)}\). Kaplan-Meier formula or product limit formula \[\begin{eqnarray*} \widehat{S}(y_{k}) &=& \widehat{\mathbb{P}}(Y > y_{(k)}) \\ &=& \widehat{\mathbb{P}}(Y > y_{(k-1)}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \widehat{S}(y_{k-1}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\ &=& \cdots \\ &=& \prod_{j=1}^k \widehat{\mathbb{P}}(Y > y_{(j)} \mid Y > y_{(j-1)}) \\ &=& \prod_{j=1}^k \left( \frac{n_j - d_j}{n_j} \right), \end{eqnarray*}\] where \[\begin{eqnarray*} n_j &=& \text{number of subjects who are alive just before time $y_{(j)}$}, \\ d_j &=& \text{number of deaths that occur at time $y_{(j)}$}. \end{eqnarray*}\] Here we do not have cencored data so the survival probabilities in each row are calculated as \(\frac{n.risk - n.event}{total\text{ }number}\). For example, the survival probability at time 3.5 for group 1 is calculated as \(\frac{20 - 1}{25} = 0.76\).

  1. Use R to display the Kaplan-Meier survival curves for groups 1 (chr = 0) and 2 (chr = 1).

Answer:

chr_data <- data.frame(
  chr = c(rep(0, 25), rep(1, 25)),
  time = c(1.8, 2.2, 2.5, 2.6, 3.0, 3.5, 3.8, 5.3, 5.4, 5.7, 6.6, 8.2, 8.7, 9.2, 
           9.2,  9.8, 10.0, 10.2, 10.7, 11.0, 11.1, 11.7, 11.7, 11.7, 11.7, 1.4, 
           1.6, 1.8, 2.4, 2.8, 2.9, 3.1, 3.5, 3.6, 3.9, 4.1, 4.2, 4.7, 4.9, 5.2, 
           5.8, 5.9, 6.5, 7.8, 8.3, 8.4, 8.8, 9.1, 9.9, 11.4),
  cen = c(rep(1, 50))
)

kmfit <- survfit(Surv(time, cen) ~ chr, data = chr_data)
summary(kmfit)
Call: survfit(formula = Surv(time, cen) ~ chr, data = chr_data)

                chr=0 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  1.8     25       1     0.96  0.0392       0.8862        1.000
  2.2     24       1     0.92  0.0543       0.8196        1.000
  2.5     23       1     0.88  0.0650       0.7614        1.000
  2.6     22       1     0.84  0.0733       0.7079        0.997
  3.0     21       1     0.80  0.0800       0.6576        0.973
  3.5     20       1     0.76  0.0854       0.6097        0.947
  3.8     19       1     0.72  0.0898       0.5639        0.919
  5.3     18       1     0.68  0.0933       0.5197        0.890
  5.4     17       1     0.64  0.0960       0.4770        0.859
  5.7     16       1     0.60  0.0980       0.4357        0.826
  6.6     15       1     0.56  0.0993       0.3956        0.793
  8.2     14       1     0.52  0.0999       0.3568        0.758
  8.7     13       1     0.48  0.0999       0.3192        0.722
  9.2     12       2     0.40  0.0980       0.2475        0.646
  9.8     10       1     0.36  0.0960       0.2135        0.607
 10.0      9       1     0.32  0.0933       0.1807        0.567
 10.2      8       1     0.28  0.0898       0.1493        0.525
 10.7      7       1     0.24  0.0854       0.1195        0.482
 11.0      6       1     0.20  0.0800       0.0913        0.438
 11.1      5       1     0.16  0.0733       0.0652        0.393
 11.7      4       4     0.00     NaN           NA           NA

                chr=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  1.4     25       1     0.96  0.0392      0.88618        1.000
  1.6     24       1     0.92  0.0543      0.81957        1.000
  1.8     23       1     0.88  0.0650      0.76141        1.000
  2.4     22       1     0.84  0.0733      0.70791        0.997
  2.8     21       1     0.80  0.0800      0.65761        0.973
  2.9     20       1     0.76  0.0854      0.60974        0.947
  3.1     19       1     0.72  0.0898      0.56386        0.919
  3.5     18       1     0.68  0.0933      0.51967        0.890
  3.6     17       1     0.64  0.0960      0.47698        0.859
  3.9     16       1     0.60  0.0980      0.43566        0.826
  4.1     15       1     0.56  0.0993      0.39563        0.793
  4.2     14       1     0.52  0.0999      0.35681        0.758
  4.7     13       1     0.48  0.0999      0.31919        0.722
  4.9     12       1     0.44  0.0993      0.28275        0.685
  5.2     11       1     0.40  0.0980      0.24749        0.646
  5.8     10       1     0.36  0.0960      0.21346        0.607
  5.9      9       1     0.32  0.0933      0.18071        0.567
  6.5      8       1     0.28  0.0898      0.14934        0.525
  7.8      7       1     0.24  0.0854      0.11947        0.482
  8.3      6       1     0.20  0.0800      0.09132        0.438
  8.4      5       1     0.16  0.0733      0.06517        0.393
  8.8      4       1     0.12  0.0650      0.04151        0.347
  9.1      3       1     0.08  0.0543      0.02117        0.302
  9.9      2       1     0.04  0.0392      0.00586        0.273
 11.4      1       1     0.00     NaN           NA           NA
autoplot(kmfit)

  1. Write down the log-likelihood of the parametric exponential (proportional hazard) model for survival times. Explain why this model can be fit as a generalized linear model with offset.

Answer: - A datum of a survival data set is \((y_j, \delta_j, \mathbf{x}_j)\) where
- \(y_j\) is the survival time,
- \(\delta_j\) is the censoring indicator with \(\delta_j = 1\) if the survival time is uncensored and \(\delta_j=0\) if it is censored, and
- \(\mathbf{x}_j\) is the predictor vector.

  • Let \(y_1, \ldots, y_r\) be the uncensored observations and \(y_{r+1}, \ldots, y_n\) the censored ones. The likelihood is \[ L = \prod_{j=1}^r f(y_j) \times \prod_{j=r+1}^n S(y_j) = \prod_{j=1}^n f(y_j)^{\delta_j} S(y_j)^{1 - \delta_j} \] so the log-likelihood is \[ \ell = \sum_{i=1}^n [\delta_j \log f(y_j) + (1 - \delta_j) \log S(y_j)] = \sum_{j=1}^n [\delta_j \log h(y_j) + \log S(y_j)] \] , where \(h(y_j) = f(y_j) / S(y_j)\) is the hazard function and \(S(y_j) = 1-F(y_j)\) is the survival function.

  • The log-likelihood for the exponential model is \[ \ell(\boldsymbol{\theta}) = \sum_j [\delta_j \log h(y_j) + \log S(y_j)] = \sum_j (\delta_j \log \theta_j - \theta_j y_j), \] since the survival function is \(S(y_j) = e^{-\theta_j y_j}\) and the hazard function is \(h(y_j) = \theta_j\) for the exponential model. For poisson regression, the log-likelihood is \[ \ell(\boldsymbol{\beta}) = \sum_i y_i \log \mu_i - \mu_i - \log y_i! \] If treating \(\delta_j\) as Poisson\((\mu_j)\) where \(\mu_j = \theta_j y_j\) then the log-likelihood for the poisson model is \[ \ell(\boldsymbol{\theta}) = \sum_j y_j \log (\theta_j y_j) - \theta_j y_j - \log y_j! \] The log-likehood for the exponential model is \[ \ell(\boldsymbol{\theta}) = \sum_j (\delta_j \log (\theta_j y_j) - \theta_j y_j- \delta_j \log y_j) \] Only the survival times \(y_j\) are included in the model as an offset term \(\log y_j\). And \(\log \mu_j = \log (\theta_j y_j) = \mathbf{x}_j^T \boldsymbol{\beta} + \log(y_j)\). So the exponential model can be fit as a generalized linear model with offset.

  1. Fit the exponential (proportional hazard) model on the chr data using R. Interpret the coefficients.

Answer:

emod <- survreg(Surv(time, cen) ~ chr, dist = "exponential", data = chr_data)
summary(emod)

Call:
survreg(formula = Surv(time, cen) ~ chr, data = chr_data, dist = "exponential")
             Value Std. Error     z      p
(Intercept)  2.014      0.200 10.07 <2e-16
chr         -0.350      0.283 -1.24   0.22

Scale fixed at 1 

Exponential distribution
Loglik(model)= -141.9   Loglik(intercept only)= -142.7
    Chisq= 1.52 on 1 degrees of freedom, p= 0.22 
Number of Newton-Raphson Iterations: 4 
n= 50 

Coefficient for intercept = -0.350: When patients do not have history of chronic disease, the hazard rate is \(e^{-0.350} = 0.705\).

Coefficient for chr = 2.014: When patients have history of chronic disease, the hazard rate is \(e^{2.014} = 7.49\) times higher than those who do not have history of chronic disease.

  1. Comment on the limitation of exponential model compared to other more flexible models such as Weibull.

Answer: The exponential model assumes that the hazard rate is constant over time. This may not be realistic in many cases. The Weibull model is more flexible than the exponential model because it allows the hazard rate to change over time.

2 Q2 (25 pts). Longitudinal data analysis

Onychomycosis, popularly known as toenail fungus, is a fairly common condition that not only can disfigure and sometimes destroy the nail but that also can lead to social and self-image issues for sufferers. Tight-fitting shoes or hosiery, the sharing of common facilities such as showers and locker rooms, and toenail polish are all thought to be implicated in the development of onychomycosis. This question relates to data from a study conducted by researchers that recruited sufferers of a particular type of onychomycosis, dermatophyte onychomycosis. The study conducted by the researchers was focused on comparison of two oral medications, terbinafine (given as 250 mg/day, denoted as treatment 1 below) and itraconazole (given as 200 mg/day, denoted as treatment 2 below).

The trial was conducted as follows. 200 sufferers of advanced toenail dermatophyte onychomycosis in the big toe were recruited, and each saw a physician, who removed the afflicted nail. Each subject was then randomly assigned to treatment with either terbinafine (treatment 1) or itraconazole (treatment 2). Immediately prior to beginning treatment, the length of the unafflicted part of the toenail (which was hence not removed) was recorded (in millimeters). Then at 1 month, 2 months, 3 months, 6 months, and 12 months, each subject returned, and the length of the unafflicted part of the nail was measured again. A longer unafflicted nail length is a better outcome. Also recorded on each subject was gender and an indicator of the frequency with which the subject visited a gym or health club (and hence might use shared locker rooms and/or showers).

The data are available in the file toenail.txt from here. The data are presented in the form of one data record per observation; the columns of the data set are as follows:

  1. Subject id

  2. Health club frequency indicator (= 0 if once a week or less, = 1 if more than once a week)

  3. Gender indicator (= 0 if female, = 1 if male)

  4. Month

  5. Unafflicted nail length (the response, mm)

  6. Treatment indicator (= 1 if terbinafine, = 2 if itraconazole)

The researchers had several questions, which they stated to you as follows:

  1. Use the linear mixed effect model (LMM) to answer: Is there a difference in the pattern of change of lengths of the unafflicted part of the nail between subjects receiving terbinafine and itraconazole over a 12 month period? Does one treatment show results more quickly?

    • Plot the change of lengths of the unafflicted part of the nail over time and separated by treatment groups. Comment on overall patterns over time.

    • Based on the pattern observed, pick appropriate time trend in the LMM and provide an algebraic definition for your chosen LMM, e.g., is the linear trend model adequate? or quadratic trend is needed? or any other pattern is more approriate? justify your answer.

    • Model the covariance: fit both random intercept and random slope model and determine which one fits the data better.

Answer:

toenail <- read.table("toenail.txt", header = FALSE)
colnames(toenail) <- c("id", "health", "gender", "month", "length", "treatment")
toenail$health <- factor(toenail$health, levels = c(0, 1), labels = c("Low", "High"))
toenail$gender <- factor(toenail$gender, levels = c(0, 1), labels = c("Female", 
                                                                      "Male"))
toenail$treatment <- factor(toenail$treatment, levels = c(1, 2), 
                            labels = c("Terbinafine", "Itraconazole"))
library(ggplot2)
ggplot(toenail, aes(x = month, y = length, group = id, color = treatment)) + 
  geom_line() +
  facet_wrap(~treatment) +
  labs(x = "Month", y = "Length (mm)")

It seems that the length of the unafflicted part of the nail increases over time for both treatments. The increase seems to be faster for itraconazole than terbinafine.

From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both treatments. For the analysis part, we only include treatment and month and control for health and gender. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.

library(pbkrtest)

mmod <- lmer(length ~ month + treatment + (month | id), 
             data = toenail, REML = TRUE)
mmodr <- lmer(length ~ poly(month, 2) + treatment + 
                (month | id), data = toenail, REML = TRUE)
KRmodcomp(mmod, mmodr)
large : length ~ poly(month, 2) + treatment + (month | id)
small : length ~ month + treatment + (month | id)
          stat      ndf      ddf F.scaling p.value
Ftest   0.7116   1.0000 799.0000         1  0.3992

The p-value is 0.3992, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. Next we test the interaction between treatment and month.

mmod <- lmer(length ~ month + treatment + (month | id), 
             data = toenail, REML = TRUE)
mmodr <- lmer(length ~ month * treatment + (month | id), data = toenail, 
              REML = TRUE)
KRmodcomp(mmod, mmodr)
large : length ~ month * treatment + (month | id)
small : length ~ month + treatment + (month | id)
          stat      ndf      ddf F.scaling  p.value   
Ftest   9.2913   1.0000 198.0000         1 0.002617 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is 0.002617 so the interaction between treatment and month is significant. We then consider the linear mixed model with random intercepts and random slopes: \[ \text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{treatment}_{j} + \beta_3 \text{month}_{i} \times \text{treatment}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij}, \] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[ \begin{pmatrix} b_{0,j} \\ b_{1,j} \end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}). \] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).

library(lme4)
mmod <- lmer(length ~ month * treatment + (month | id), 
             data = toenail)
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month * treatment + (month | id)
   Data: toenail

REML criterion at convergence: 4365.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6232 -0.5077  0.0249  0.4775  2.9321 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 2.2121   1.4873        
          month       0.1104   0.3323   -0.38
 Residual             0.9412   0.9701        
Number of obs: 1200, groups:  id, 200

Fixed effects:
                            Estimate Std. Error t value
(Intercept)                  5.34137    0.15883  33.630
month                        0.27149    0.03464   7.837
treatmentItraconazole        0.10069    0.22462   0.448
month:treatmentItraconazole  0.14933    0.04899   3.048

Correlation of Fixed Effects:
            (Intr) month  trtmnI
month       -0.414              
trtmntItrcn -0.707  0.293       
mnth:trtmnI  0.293 -0.707 -0.414
confint(mmod, method = "boot")
                                  2.5 %     97.5 %
.sig01                       1.31573131  1.6584573
.sig02                      -0.51414879 -0.2308480
.sig03                       0.29697376  0.3652581
.sigma                       0.92457234  1.0215266
(Intercept)                  5.04644844  5.6657562
month                        0.20336329  0.3375551
treatmentItraconazole       -0.33843082  0.5114830
month:treatmentItraconazole  0.06682768  0.2444718

sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The itraconazole group is not significantly different from the terbinafine group over a 12 month period because the confidence interval includes 0.

  1. Use the linear mixed effect model (LMM) to answer: Is there an association between the pattern of change of nail lengths and gender and/or health club frequency in subjects taking terbinafine? This might indicate that this drug brings about relief more swiftly in some kinds of subject versus others.

    • Provide graphs to show patterns the change of nail lengths and gender and/or health club frequency in subjects taking terbinafine.

    • Based on the pattern observed from question 1, pick appropriate time trend in the LMM and provide an algebraic definition for your chosen LMM, e.g., is the linear trend model adequate? or quadratic trend is needed? or any other pattern is more approriate? justify your answer.

    • Model the covariance: fit both random intercept and random slope model and determine which one fits the data better.

Answer:

  • association between length and gender
toenail |>
  filter(treatment == "Terbinafine") |>
  ggplot(aes(x = month, y = length, group = id, color = gender)) + 
  geom_line() +
  facet_wrap(~gender) +
  labs(x = "Month", y = "Length (mm)")

It seems that the length of the unafflicted part of the nail increases over time for both gender groups when treatment is terbinafine. The length of the unafflicted part of the nail for female seems to be a little higher than the male.

From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both genders. For the analysis part, we only include gender and month and control for health and treatment. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.

mmod <- lmer(length ~ month + gender + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
mmodr <- lmer(length ~ poly(month, 2) + gender + 
                (month | id), data = filter(toenail, treatment == "Terbinafine"), 
              REML = TRUE)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00234707 (tol = 0.002, component 1)
KRmodcomp(mmod, mmodr)
large : length ~ poly(month, 2) + gender + (month | id)
small : length ~ month + gender + (month | id)
          stat      ndf      ddf F.scaling p.value
Ftest   0.2822   1.0000 399.0000         1  0.5956

The p-value is 0.5956, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. So the association between the pattern of change of nail lengths and gender is linear. Next we test the interaction between treatment and month.

mmod <- lmer(length ~ month + gender + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
mmodr <- lmer(length ~ month * gender + (month | id), 
              data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
KRmodcomp(mmod, mmodr)
large : length ~ month * gender + (month | id)
small : length ~ month + gender + (month | id)
         stat     ndf     ddf F.scaling p.value
Ftest  0.8585  1.0000 98.0000         1  0.3564

The p-value is 0.3564 so the interaction between month and gender is not significant. We do not need to include interaction term in the model. We then consider the linear mixed model with random intercepts and random slopes: \[ \text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{gender}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij}, \] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[ \begin{pmatrix} b_{0,j} \\ b_{1,j} \end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}). \] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).

library(lme4)
mmod <- lmer(length ~ month + gender + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"))
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month + gender + (month | id)
   Data: filter(toenail, treatment == "Terbinafine")

REML criterion at convergence: 2122.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.76246 -0.51568  0.02272  0.48454  2.76942 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 id       (Intercept) 2.18841  1.4793       
          month       0.04666  0.2160   0.17
 Residual             0.93674  0.9679       
Number of obs: 600, groups:  id, 100

Fixed effects:
            Estimate Std. Error t value
(Intercept)  5.17497    0.22568  22.931
month        0.27149    0.02371  11.450
genderMale   0.32629    0.31589   1.033

Correlation of Fixed Effects:
           (Intr) month 
month       0.027       
genderMale -0.714  0.000
confint(mmod, method = "boot")
                  2.5 %    97.5 %
.sig01       1.25119307 1.7101695
.sig02      -0.05177107 0.4091293
.sig03       0.17783459 0.2506884
.sigma       0.90441382 1.0305841
(Intercept)  4.75913245 5.6038947
month        0.22512976 0.3182264
genderMale  -0.31418434 1.0168516

sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The male group is not significantly different from the female group because the confidence interval includes 0.

  • association between length and health club frequency
toenail |>
  filter(treatment == "Terbinafine") |>
  ggplot(aes(x = month, y = length, group = id, color = health)) + 
  geom_line() +
  facet_wrap(~health) +
  labs(x = "Month", y = "Length (mm)")

It seems that the length of the unafflicted part of the nail increases over time for both gender groups when treatment is terbinafine. The length of the unafflicted part of the nail for low health club frequency group seems to be higher than the high health club frequency group.

From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both health frequencies. For the analysis part, we only include health and month and control for gender and treatment. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.

mmod <- lmer(length ~ month + health + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
mmodr <- lmer(length ~ poly(month, 2) + health + 
                (month | id), data = filter(toenail, treatment == "Terbinafine"), 
              REML = TRUE)
KRmodcomp(mmod, mmodr)
large : length ~ poly(month, 2) + health + (month | id)
small : length ~ month + health + (month | id)
          stat      ndf      ddf F.scaling p.value
Ftest   0.2822   1.0000 399.0000         1  0.5956

The p-value is 0.5956, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. So the association between the pattern of change of nail lengths and health club frequency is linear. Next we test the interaction between treatment and month.

mmod <- lmer(length ~ month + health + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
mmodr <- lmer(length ~ month * health + (month | id), 
              data = filter(toenail, treatment == "Terbinafine"), REML = TRUE)
KRmodcomp(mmod, mmodr)
large : length ~ month * health + (month | id)
small : length ~ month + health + (month | id)
         stat     ndf     ddf F.scaling p.value
Ftest  0.0066  1.0000 98.0000         1  0.9356

The p-value is 0.9356 so the interaction between month and gender is not significant. We do not need to include interaction term in the model. We then consider the linear mixed model with random intercepts and random slopes: \[ \text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{health}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij}, \] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[ \begin{pmatrix} b_{0,j} \\ b_{1,j} \end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}). \] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).

library(lme4)
mmod <- lmer(length ~ month + health + (month | id), 
             data = filter(toenail, treatment == "Terbinafine"))
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month + health + (month | id)
   Data: filter(toenail, treatment == "Terbinafine")

REML criterion at convergence: 2110.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.69831 -0.51211  0.03935  0.49881  2.83201 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 id       (Intercept) 1.89097  1.3751       
          month       0.04666  0.2160   0.17
 Residual             0.93674  0.9679       
Number of obs: 600, groups:  id, 100

Fixed effects:
            Estimate Std. Error t value
(Intercept)  6.14331    0.25810  23.802
month        0.27149    0.02371  11.450
healthHigh  -1.19692    0.31526  -3.797

Correlation of Fixed Effects:
           (Intr) month 
month       0.020       
healthHigh -0.818  0.000
confint(mmod, method = "boot")
                  2.5 %     97.5 %
.sig01       1.15800218  1.5890392
.sig02      -0.04891836  0.4293030
.sig03       0.17618747  0.2506294
.sigma       0.89640723  1.0371711
(Intercept)  5.62966473  6.6366288
month        0.22470756  0.3179620
healthHigh  -1.87821269 -0.5581067

sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The high frequency group is significantly different from the low frequency group because the confidence interval includes 0.

  1. In answering these scientific questions of interest, clearly write out the analytic models you consider for answering these questions (as detailed in the sub-questions). Clearly outline your decision making process for how you selected your final models. Fit your chosen final models and report to the project investigators on the stated scientific questions of interest.

Answer: From the Q2.1, we see that the linear trend model is adequate and we need to include the interaction term between treatment and month. From the t-value of the linear mixed model, we notice month and the interaction term is significant but the main effect of treatment is not significant. However, in order to maintain the hierarchically well-formulated principle, we include the treatment in the model. Also, by checking the confidence interval of random intercepts and slopes, we see that both are significant.

From the Q2.2, we see that both month-gender and month-health models are linear but two interactions are not significant. Also, from the t-value of linear mixed models, in month-gender model, gender is not significant, while in month-health model, month and health` are significant. By checking the confidence interval of random intercepts and slopes in two models, we see that both are significant.

So we consider the linear mixed model with random intercepts and random slopes. The final model is \[ \text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{treatment}_{j} + \beta_3 \text{month}_{i} \times \text{treatment}_{j}+\beta_4 \text{health}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij}, \] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[ \begin{pmatrix} b_{0,j} \\ b_{1,j} \end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}). \] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).

mmod <- lmer(length ~ month * treatment + health + (month | id), 
             data = toenail)
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month * treatment + health + (month | id)
   Data: toenail

REML criterion at convergence: 4340.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.58679 -0.51413  0.02595  0.47767  2.92726 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 2.0600   1.4353        
          month       0.1104   0.3322   -0.45
 Residual             0.9412   0.9701        
Number of obs: 1200, groups:  id, 200

Fixed effects:
                            Estimate Std. Error t value
(Intercept)                  6.06663    0.20557  29.511
month                        0.27149    0.03464   7.838
treatmentItraconazole        0.08987    0.21775   0.413
healthHigh                  -1.08247    0.20331  -5.324
month:treatmentItraconazole  0.14933    0.04899   3.048

Correlation of Fixed Effects:
            (Intr) month  trtmnI hlthHg
month       -0.354                     
trtmntItrcn -0.536  0.334              
healthHigh  -0.663  0.000  0.009       
mnth:trtmnI  0.250 -0.707 -0.472  0.000

We can report that the length of the unafflicted part of the nail increases linearly over time for both health frequencies and for both treatments. The high frequency group is significantly different from the low frequency group. It seems if we want to increase the length of the unafflicted part of the nail, we should reduce the health club frequency and use the itraconazole treatment.

3 Q3 (25 pts). GEE and GLMM

The Skin Cancer Prevention Study, a randomized, double-blind, placebo-controlled clinical trial, was designed to test the effectiveness of beta-carotene in the prevention of non-melanoma skin cancer in high-risk subjects. A total of 1,683 subjects were randomized to either placebo or 50mg of beta-carotene per day and were followed for up to 5 years. Subjects were examined once per year and biopsied if a cancer was suspected to determine the number of new cancers per year. The outcome variable, \(Y\), is a count of the number of new skin cancers per year. You may assume that the counts of new skin cancers, \(Y\), are from exact one-year periods (so that no offset term is needed).

Selected data from the study are in the dataset called skin.txt and is available here. Each row of the dataset contains the following 9 variables: ID, Center, Age, Skin, Gender, Exposure, \(Y\), Treatment, Year. These variables take values as follows:

Variable
ID: Subject identifier number
Center: Identifier number for center of enrollment
Age: Subject’s age in years at randomization
Skin: Skin type (1=burns; 0 otherwise) [evaluated at randomization and doesn’t change with time]
Gender: 1=male; 0=female
Exposure: Count of number of previous skin cancers [prior to randomization]
\(Y\): Count of number of new skin cancers in the Year of follow-up
Treatment: 1=beta-carotene; 0=placebo
Year: Year of follow-up after starting randomized treatment

Your collaborator is interested in assessing the effect of treatment on the incidence of new skin cancers over time. As the statistician on the project, provide an analysis of the data that addresses this question. Specifically, the investigator at Center=1 is interested in characterizing the distribution of risk among subjects at her center. In the following, only include the subset of subjects with Center=1 in the analysis.

skin <- read.table("skin.txt", header = FALSE)
colnames(skin) <- c("ID", "Center", "Age", "Skin", "Gender", "Exposure", "Y", 
                    "Treatment", "Year")
skin <- skin |>
  filter(Center == 1) |>
  mutate(Skin = factor(Skin, levels = c(0, 1), labels = c("otherwise", "burns")),
         Gender = factor(Gender, levels = c(0, 1), labels = c("female", "male")),
         Treatment = factor(Treatment, levels = c(0, 1), 
                            labels = c("placebo", "beta-carotene")))
skin |>
  head()
      ID Center Age  Skin Gender Exposure Y Treatment Year
1 100034      1  51 burns   male        4 0   placebo    1
2 100034      1  51 burns   male        4 1   placebo    2
3 100034      1  51 burns   male        4 1   placebo    3
4 100034      1  51 burns   male        4 1   placebo    4
5 100034      1  51 burns   male        4 0   placebo    5
6 100045      1  68 burns female        2 0   placebo    1
  1. Provide an algebraic definition for a generalized linear marginal model (GEE) in which the only effects are for the intercept and Year (as a continuous variable). Fit this model and provide a table which includes the estimates of the parameters in your model.

Answer: The GEE is defined as follows: Let \(\mathbf{Y}_i\) be the response vector of \(i\)-th individual. \[ \begin{eqnarray*} \mathbb{E} \mathbf{Y}_i &=& \boldsymbol{\mu}_i \\ g(\boldsymbol{\mu}_i) &=& \mathbf{X}_i \boldsymbol{\beta} = \beta_0 + \beta_1 \text{Year}_i \\ \mathbf{V}_i &=& \text{Var}(\mathbf{Y}_i) = \phi \mathbf{A}_i^{1/2} \mathbf{R}_i(\boldsymbol{\alpha}) \mathbf{A}_i^{1/2}, \end{eqnarray*} \] where \(\mathbf{A}_i = \text{diag}(a(\boldsymbol{\mu}))\) captures the individual variances and \(\mathbf{R}_i(\boldsymbol{\alpha})\) is a working correlation matrix. Given estimates of \(\phi\) and \(\boldsymbol{\alpha}\), we solve the estimation equation \[ \sum_i [D_{\boldsymbol{\beta}} \boldsymbol{\mu}_i(\boldsymbol{\beta})]^T \mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol{\mu}_i) = \mathbf{0}. \]

library(geepack)
library(gtsummary)
library(kableExtra)
mmod <- geeglm(Y ~ Year, id = ID, data = skin, family = poisson, 
               scale.fix = TRUE, corstr = "exchangeable")
summary(mmod)

Call:
geeglm(formula = Y ~ Year, family = poisson, data = skin, id = ID, 
    corstr = "exchangeable", scale.fix = TRUE)

 Coefficients:
            Estimate Std.err   Wald Pr(>|W|)    
(Intercept)  -1.5411  0.1674 84.739   <2e-16 ***
Year         -0.1066  0.0486  4.807   0.0283 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Scale is fixed.

  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.2306 0.08122
Number of clusters:   422  Maximum cluster size: 5 
tibble(Intercept = coef(mmod)[1], Year = coef(mmod)[2], alpha = 0.231) |>
  kable("html") |>
  kable_styling("striped", full_width = F)
Intercept Year alpha
-1.541 -0.1066 0.231
  1. Provide an algebraic definition for a generalized linear mixed model (GLMM) in which the only fixed effects are for the intercept and Year (as a continuous variable), and the only random effect is the intercept. What is being assumed about how the distribution of risk among subjects changes with time?

Answer: The GLMM is defined as follows: In GLM, the distribution of \(Y_i\) is from the exponential family of distributions of form \[ f(y_i \mid \theta_i, \phi) = \exp \left[ \frac{y \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right]. \] If we use the canonical link, then \[ \theta_i = g(\mu_i) = \eta_i, \] where \(\mu_i = \mathbb{E} Y_i\). Now let \[ \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} + \mathbf{z}_i^T \boldsymbol{\gamma}_i, \] where \(\boldsymbol{\beta}\) is the fixed effects and \(\boldsymbol{\gamma}\) is the random effects with density \(h(\gamma \mid \boldsymbol{\Sigma})\). (Typically we assume multivariate normal with mean 0 and covariance \(\boldsymbol{\Sigma}\).) Then the likelihood is \[ L(\boldsymbol{\beta}, \phi, \boldsymbol{\Sigma}) = \prod_{i=1}^n \int f(y_i \mid \boldsymbol{\beta}, \phi, \boldsymbol{\gamma}) h(\boldsymbol{\gamma} \mid \boldsymbol{\Sigma}) \, d \boldsymbol{\gamma} \] and the log-likelihood is \[ \ell(\boldsymbol{\beta}, \phi, \boldsymbol{\Sigma}) = \sum_{i=1}^n \log \int f(y_i \mid \boldsymbol{\beta}, \phi, \boldsymbol{\gamma}) h(\boldsymbol{\gamma} \mid \boldsymbol{\Sigma}) \, d \boldsymbol{\gamma}. \] This model, combining GLM and mixed effects model, is called the generalized linear mixed effects model (GLMM). In this question, \[ Y_{ij} \sim \text{Poisson}(\mu_{ij})\\ \quad \log(\mu_{ij}) = \beta_0 + \beta_1 \text{Year}_i + \gamma_{0j}\\ \gamma_{0j} \sim \text{Normal}(0, \sigma^2) \] It is assumed that number of new skin cancers in the Year of follow-up follows a Poisson distribution with mean \(\mu_{ij}\), where \(\mu_{ij}\) is a function of the fixed effects \(\beta_0\) and \(\beta_1\) and the random effect \(\gamma_{0j}\). \(\mu_{ij}\) is linear in Year and the random effect \(\gamma_{0j}\) is assumed to be normally distributed with mean 0 and variance \(\sigma^2\).

  1. Fit your chosen GLMM and provide a table from your output which includes the estimates for the parameters in your GLMM, and provide careful interpretation of the Year term.

Answer:

mmod1 <- glmer(Y ~ Year + (1 | ID), data = skin, family = poisson)
summary(mmod1)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: poisson  ( log )
Formula: Y ~ Year + (1 | ID)
   Data: skin

     AIC      BIC   logLik deviance df.resid 
  1641.7   1658.3   -817.8   1635.7     1880 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-1.477 -0.260 -0.234 -0.210  3.895 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 1.57     1.25    
Number of obs: 1883, groups:  ID, 422

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.2988     0.1756  -13.09   <2e-16 ***
Year         -0.1083     0.0434   -2.49    0.013 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
Year -0.647
tibble(Intercept = fixef(mmod1)[1], Year = fixef(mmod1)[2]) |>
  kable("html") |>
  kable_styling("striped", full_width = F)
Intercept Year
-2.299 -0.1083

Coefficient of Year = -0.108, which means for every additional year of follow-up after the beginning of the experiment at center 1, the estimated average number of new skin cancers decreases by $ 10.3% = 1-e^{-0.1083}$.

  1. Are the estimates for the fixed intercept terms the same or different in the GLMM compared with the marginal model fitted in question (1)? Why are they the same or different?

Answer: The estimates for the fixed intercept terms are different in the GLMM compared with the marginal model fitted in question (1). The fixed intercept term in the GLMM is -2.2988, while the fixed intercept term in the marginal model is -1.541.

The fixed intercept term in GEE represents the average response across the population, while in GLMM, it represents the response within a typical cluster after adjusting for the random effects. To be specific, the fixed intercept term in the GLMM is the log of the expected number of new skin cancers at center 1 during the first year of follow-up, whereas the fixed intercept term in the marginal model is the log of the expected number of new skin cancers at center 1 during the first year of follow-up averaged over all centers.

  1. Use the parameter estimates from your GLMM and your model definition to characterize the distribution of expected counts of new skin cancers among subjects at center 1 during their first year of follow-up.

Answer: \[ \begin{aligned} \log(\mu) &= -2.2988 + -0.1083 +\gamma_{0j}= -2.4071 +\gamma_{0j}\\ \gamma_{0j} &\sim \text{Normal}(0, 1.57)\\ \log(\mu) &\sim \text{Normal}(-2.4071, 1.57)\\ \mu &\sim \text{Log-Normal}(-2.4071, 1.57) \end{aligned} \]

4 Q4. (25 pts) LMM and GAMM

This question is adapted from Exercise 11.2 of ELMR (p251). Read the documentation of the dataset hprice in Faraway package before working on this problem.

  1. Make a plot of the data on a single panel to show how housing prices increase by year. Describe what can be seen in the plot.

Answer:

data(hprice)
ggplot(hprice, aes(x = time, y = narsp, group = msa)) + 
  geom_line() +
  labs(x = "Year", y = "Natural Log of House Price")

From the plot, we can see that most of MSA’s house prices increase over time. Most of MSA’s natural log average house prices are between 4 to 4.5 thousand dollars and these MSA has almost the same slope. Several MSA’s natural log average house prices are around 5 thousand dollars and they have almost the same slope.

  1. Fit a linear model with the (log) house price as the response and all other variables (except msa) as fixed effect predictors. Which terms are statistically significant? Discuss the coefficient for time.

Answer:

mod <- lm(narsp ~ . - msa, data = hprice)
summary(mod)

Call:
lm(formula = narsp ~ . - msa, data = hprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3139 -0.1081 -0.0152  0.0855  0.5559 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.67e+00   8.46e-02   31.61  < 2e-16 ***
ypc          7.03e-05   4.36e-06   16.13  < 2e-16 ***
perypc      -1.37e-02   5.07e-03   -2.70  0.00722 ** 
regtest      2.95e-02   3.10e-03    9.52  < 2e-16 ***
rcdum1       1.49e-01   3.24e-02    4.60  6.1e-06 ***
ajwtr1       3.59e-02   2.00e-02    1.80  0.07348 .  
time        -1.77e-02   5.13e-03   -3.45  0.00065 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.165 on 317 degrees of freedom
Multiple R-squared:  0.757, Adjusted R-squared:  0.753 
F-statistic:  165 on 6 and 317 DF,  p-value: <2e-16

Intercept, ypc, perypc, regtest, rcdum, and time are statistically significant. Coefficient for time = -1.77e-02. This means that for every additional year, the average house price in thousands of dollars is estimated to decrease by \(1-e^{-0.0177} = 1.75\%\), controlling for all other variables in the model.

  1. Make a plot that shows how per-capita income changes over time. What is the nature of the increase? Make a similar plot to show how income growth changes over time. Comment on the plot.

Answer:

ggplot(hprice, aes(x = time, y = ypc, group = msa)) + 
  geom_line() +
  labs(x = "Year", y = "Per Capita Income")

From the plot, all the MSA’s per capita income increases over time and it seems the increase is linear.

ggplot(hprice, aes(x = time, y = perypc, group = msa)) + 
  geom_line() +
  labs(x = "Year", y = "Per Capita Income")

From the plot, the percentage growth in per capita income is not linear but they have a similar pattern. We can see all the percentage growth in per capita income decreases in 1987 and then increases to a peak in 1989. After 1989, most of the percentage growth per capita income decreases, with some decreasing significantly and some slightly. Some percentage growth per capita income increases slightly after 1989. Finally, all percentage growth per capita income converge to around 4 percentage growth.

  1. Create a new variable that is the per-capita income for the first time period for each MSA. Refit the same linear model but now using the initial income and not the income as it changes over time. Compare the two models.

Answer:

hprice <- hprice |>
  group_by(msa) |> 
  mutate(ypc_initial = first(ypc))
mod2 <- lm(narsp ~ . - msa - ypc + ypc_initial, data = hprice)
summary(mod2)

Call:
lm(formula = narsp ~ . - msa - ypc + ypc_initial, data = hprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3728 -0.1055 -0.0154  0.0790  0.5391 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.30e+00   9.67e-02   23.80  < 2e-16 ***
perypc      -8.13e-03   4.97e-03   -1.64     0.10    
regtest      3.01e-02   3.04e-03    9.93  < 2e-16 ***
rcdum1       1.51e-01   3.16e-02    4.78  2.7e-06 ***
ajwtr1       3.85e-02   1.96e-02    1.97     0.05 .  
time         3.71e-02   3.79e-03    9.80  < 2e-16 ***
ypc_initial  8.87e-05   5.28e-06   16.81  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.162 on 317 degrees of freedom
Multiple R-squared:  0.766, Adjusted R-squared:  0.762 
F-statistic:  173 on 6 and 317 DF,  p-value: <2e-16
summary(mod)

Call:
lm(formula = narsp ~ . - msa, data = hprice)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3139 -0.1081 -0.0152  0.0855  0.5559 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.67e+00   8.46e-02   31.61  < 2e-16 ***
ypc          7.03e-05   4.36e-06   16.13  < 2e-16 ***
perypc      -1.37e-02   5.07e-03   -2.70  0.00722 ** 
regtest      2.95e-02   3.10e-03    9.52  < 2e-16 ***
rcdum1       1.49e-01   3.24e-02    4.60  6.1e-06 ***
ajwtr1       3.59e-02   2.00e-02    1.80  0.07348 .  
time        -1.77e-02   5.13e-03   -3.45  0.00065 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.165 on 317 degrees of freedom
Multiple R-squared:  0.757, Adjusted R-squared:  0.753 
F-statistic:  165 on 6 and 317 DF,  p-value: <2e-16

In this new model, the first time per-capita income ypc_initial is also statistically significant. The coefficient of time becomes positive while in the previous model it was negative. Now the positive coefficient is consistent with the plot in Q3. The adjusted R-squared value of the new model is 0.762, which is slightly higher than the previous model’s adjusted R-squared value of 0.753.

  1. Fit a mixed effects model that has a random intercept for each MSA. Why might this be reasonable? The rest of the model should have the same structure as in the previous question. Make a numerical interpretation of the coefficient of time in your model. Explain the difference between REML and MLE methods.

Answer:

mmod <- lmer(narsp ~ . - msa - ypc + ypc_initial + (1|msa), data = hprice)
Warning: Some predictor variables are on very different scales: consider
rescaling
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: narsp ~ . - msa - ypc + ypc_initial + (1 | msa)
   Data: hprice

REML criterion at convergence: -581.3

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-2.385 -0.594 -0.025  0.574  2.858 

Random effects:
 Groups   Name        Variance Std.Dev.
 msa      (Intercept) 0.02361  0.1537  
 Residual             0.00545  0.0738  
Number of obs: 324, groups:  msa, 36

Fixed effects:
             Estimate Std. Error t value
(Intercept)  2.31e+00   2.63e-01    8.78
perypc      -9.15e-03   2.30e-03   -3.98
regtest      3.02e-02   8.75e-03    3.45
rcdum1       1.51e-01   9.11e-02    1.66
ajwtr1       3.85e-02   5.65e-02    0.68
time         3.68e-02   1.73e-03   21.28
ypc_initial  8.86e-05   1.52e-05    5.83

Correlation of Fixed Effects:
            (Intr) perypc regtst rcdum1 ajwtr1 time  
perypc      -0.049                                   
regtest     -0.504 -0.005                            
rcdum1       0.478 -0.004 -0.310                     
ajwtr1       0.110  0.001 -0.040 -0.178              
time        -0.047  0.395 -0.002 -0.002  0.000       
ypc_initial -0.751  0.002 -0.174 -0.332 -0.182  0.001
fit warnings:
Some predictor variables are on very different scales: consider rescaling

Coefficient of time = 3.68e-02. This means that for every additional year, the average house price in thousands of dollars is estimated to increase by \(e^{0.0368}-1 = 3.75\%\), controlling for all other variables in the model.

The maximum likelihood estimation method (MLE) is biased for the estimation of variance components. The restricted maximum likelihood estimation method (REML) tries to reduce bias in the variance components. The variance component parameters are estimated by MLE using transformed data and then fixed effects are estimated using general least squares. The REML estimate does not depend on the choice of data transformation in the model.

  1. Fit a model that omits the adjacent to water and rent control predictors. Test whether this reduction in the model can be supported.

Answer:

mmod2 <- lmer(narsp ~ . - msa - ypc + ypc_initial + (1|msa) - ajwtr - rcdum, 
             data = hprice)
Warning: Some predictor variables are on very different scales: consider
rescaling
summary(mmod2)
Linear mixed model fit by REML ['lmerMod']
Formula: narsp ~ . - msa - ypc + ypc_initial + (1 | msa) - ajwtr - rcdum
   Data: hprice

REML criterion at convergence: -584.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3793 -0.5958 -0.0131  0.5801  2.8500 

Random effects:
 Groups   Name        Variance Std.Dev.
 msa      (Intercept) 0.02489  0.1578  
 Residual             0.00545  0.0738  
Number of obs: 324, groups:  msa, 36

Fixed effects:
             Estimate Std. Error t value
(Intercept)  2.05e+00   2.31e-01    8.87
perypc      -9.13e-03   2.30e-03   -3.97
regtest      3.55e-02   8.49e-03    4.18
time         3.68e-02   1.73e-03   21.28
ypc_initial  1.01e-04   1.42e-05    7.08

Correlation of Fixed Effects:
            (Intr) perypc regtst time  
perypc      -0.053                     
regtest     -0.416 -0.006              
time        -0.053  0.395 -0.002       
ypc_initial -0.698  0.000 -0.349  0.000
fit warnings:
Some predictor variables are on very different scales: consider rescaling
KRmodcomp(mmod, mmod2)
large : narsp ~ . - msa - ypc + ypc_initial + (1 | msa)
small : narsp ~ . - msa - ypc + ypc_initial + (1 | msa) - ajwtr - rcdum
       stat   ndf   ddf F.scaling p.value
Ftest  1.87  2.00 31.00         1    0.17

The p-value is 0.17, which is greater than 0.05. Therefore, we fail to reject the null hypothesis that the reduced model is significantly different from the full model. So the reduction in the model can be supported.

  1. It is possible that the increase in prices may not be linear in year. Fit an additive mixed model where smooth is added to year. Make a plot to show how prices have increased over time.

Answer:

library(mgcv)
mmod3 <- gamm(narsp ~ perypc + regtest + ypc_initial + s(time, k = 9), 
              random = list(msa = ~1), data = hprice) 
summary(mmod3$gam)

Family: gaussian 
Link function: identity 

Formula:
narsp ~ perypc + regtest + ypc_initial + s(time, k = 9)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.24e+00   2.22e-01   10.08  < 2e-16 ***
perypc      -1.12e-02   2.49e-03   -4.50  9.4e-06 ***
regtest      3.56e-02   8.18e-03    4.34  1.9e-05 ***
ypc_initial  1.01e-04   1.37e-05    7.35  1.7e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df   F p-value    
s(time) 2.73   2.73 156  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.74   
  Scale est. = 0.0052007  n = 324
plot(mmod3$gam, residuals = TRUE, select = 1)

  1. Interpret the coefficients in the previous model for the initial annual income, growth and regulation predictors.

Answer:

summary(mmod3$gam)

Family: gaussian 
Link function: identity 

Formula:
narsp ~ perypc + regtest + ypc_initial + s(time, k = 9)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.24e+00   2.22e-01   10.08  < 2e-16 ***
perypc      -1.12e-02   2.49e-03   -4.50  9.4e-06 ***
regtest      3.56e-02   8.18e-03    4.34  1.9e-05 ***
ypc_initial  1.01e-04   1.37e-05    7.35  1.7e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df   F p-value    
s(time) 2.73   2.73 156  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.74   
  Scale est. = 0.0052007  n = 324

Coefficient of ypc_initial = 1.01e-04. This means that for every 1,000 dollars increase in the initial per capita income, the average house price in thousands of dollars is estimated to increase by \(e^{0.101}-1 = 10.6\%\), controlling for all other variables in the model.

Coefficient of perypc = -1.12e-02. This means that for every \(1\%\) increase in percentage growth in per capita income, the average house price in thousands of dollars is estimated to decrease by \(1-e^{-0.0112} = 1.11\%\), controlling for all other variables in the model.

Coefficient of regtest = 3.56e-02. This means that for every one index increase in the regulation test, the average house price in thousands of dollars is estimated to increase by \(e^{0.0356}-1 = 3.62\%\), controlling for all other variables in the model.

5 Optional Extra Credit Problem*

Midterm makeup question

This problem is meant to offer another chance to demonstrate understanding of some of the material on the mid-term. If you choose to do this problem and your score is higher than your mid-term grade, then your mid-term grade will be reweighted to be New Midterm Grade = .8*Old Midterm Grade + .2*Extra Credit Problem